UCP/CORE: Add rc_gda transport to alias list of ib#11170
UCP/CORE: Add rc_gda transport to alias list of ib#11170yosefe merged 7 commits intoopenucx:masterfrom
rc_gda transport to alias list of ib#11170Conversation
ad9fbcc to
ef85c3e
Compare
ColinNV
left a comment
There was a problem hiding this comment.
This fixes the NIXL test.
$ bin/gtest --gtest_filter="UcxHardwareWarningTest.*"
Note: Google Test filter = UcxHardwareWarningTest.*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from UcxHardwareWarningTest
[ RUN ] UcxHardwareWarningTest.WarnWhenGpuPresentButCudaNotSupported
W0205 18:58:18.931417 145837 ucx_utils.cpp:654] 1 NVIDIA GPU(s) were detected, but UCX CUDA support was not found! GPU memory is not supported.
[ OK ] UcxHardwareWarningTest.WarnWhenGpuPresentButCudaNotSupported (1490 ms)
[ RUN ] UcxHardwareWarningTest.WarnWhenIbPresentButRdmaNotSupported
W0205 18:58:19.090172 145837 ucx_utils.cpp:662] 3 IB device(s) were detected, but accelerated IB support was not found! Performance may be degraded.
[ OK ] UcxHardwareWarningTest.WarnWhenIbPresentButRdmaNotSupported (155 ms)
[ RUN ] UcxHardwareWarningTest.NoWarningWhenIbAndCudaSupported
[ OK ] UcxHardwareWarningTest.NoWarningWhenIbAndCudaSupported (154 ms)
[----------] 3 tests from UcxHardwareWarningTest (1799 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1799 ms total)
[ PASSED ] 3 tests.
src/ucp/core/ucp_context.c
Outdated
| { "dc_x", { "dc_mlx5", UCP_TL_AUX("ud_mlx5"), NULL } }, | ||
| { "ugni", { "ugni_smsg", UCP_TL_AUX("ugni_udt"), "ugni_rdma", NULL } }, | ||
| { "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", NULL } }, | ||
| { "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", "rc_gda", NULL } }, |
There was a problem hiding this comment.
not sure we also want it in cuda alias. E.g. if I want to use UCX_TLS=rc,cuda it will automatically enable rc_gda (which requires a lot of resources). Also not sure if we need it in rc alias.
@ofirfarjun7, @Artemy-Mellanox wdyt?
There was a problem hiding this comment.
I don't see it in rc alias what am I'm missing?
We need to disable it if user disables ib/cuda. Do we have other conventional way to do it beside the aliases?
Regarding resources, with latest implementation it create additional transport for each NIC (NIC-GPU) transport , but just one not all combos (@Artemy-Mellanox please correct me if I'm wrong).
src/ucp/core/ucp_context.c
Outdated
| { "dc_x", { "dc_mlx5", UCP_TL_AUX("ud_mlx5"), NULL } }, | ||
| { "ugni", { "ugni_smsg", UCP_TL_AUX("ugni_udt"), "ugni_rdma", NULL } }, | ||
| { "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", NULL } }, | ||
| { "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", "rc_gda", NULL } }, |
There was a problem hiding this comment.
IMO it should be only for IB
|
Do we agree that |
Removed from the |
rc_gda transport to aliases list of ib and cudarc_gda transport to alias list of ib
2e67fc2
|
maybe we want to do the fd leak fix in separate PR? could it also be helpful in v1.20? |
73af919 to
4f078f1
Compare
What?
Add
rc_gdatransport to aliases list ofiband cudaas it depends on bothWhy?
The NIXL test
UcxHardwareWarningTest.WarnWhenIbPresentButRdmaNotSupportedfailed because settingUCX_TLS=^ibdid not disable InfiniBand completely because therc_gdatransport remained available, causing the warning not to show when expected.A temporary fix for the test will be pushed in parallel to let the tests pass.
ai-dynamo/nixl#1292